Hadoop: Big Data

Big data is nothing but an assortment of such huge and complex data that becomes very tedious to capture, store, process, retrieve and analyze it. Big Data is also data but with a huge size. Big Data is a term used to describe a collection of data that is huge in size and yet growing exponentially with time. In short such data is so large and complex that none of the traditional data management tools are able to store it or process it efficiently. There are a number of technologies to ingest & run analytical queries over Big Data (i.e. large volume of data). Big Data is used in Business Intelligence (i.e. BI) reporting, Data Science, Machine Learning, and Artificial Intelligence (i.e. AI). Processing a large volume of data will be intensive on disk I/O, CPU, and memory usage. Big Data processing is based on distributed computing where you will have multiple nodes across several machines to process data in parallel. Each node will have its own dedicated hard disk, CPU, memory, etc. This is know as the “Shared-nothing architecture“. A Hadoop cluster is a collection of nodes.

Here are a few technologies you can use as depicted below:

Hadoop is suitable for massively offline batch processing whereas MPP databases like Amazon Redshift is built for online analytics.

Example of Big Data

The New York Stock Exchange generates about one terabyte of new trade data per day.
The statistic shows that 500+terabytes of new data get ingested into the databases of social media site Facebook, every day.
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many thousand flights per day, generation of data reaches up to many Petabytes.

Types Of Data

Data could be found in three forms:

Structured
Unstructured
Semi-structured

Structured

Structured data usually resides in relational databases (RDBMS). Fields store length-delineated data phone numbers, Social Security numbers, or ZIP codes. Even text strings of variable length like names are contained in records, making it a simple matter to search. Data may be human- or machine-generated as long as the data is created within an RDBMS structure. This format is eminently searchable both with human generated queries and via algorithms using type of data and field names, such as alphabetical or numeric, currency or date.

Unstructured

Unstructured data is essentially everything else. Unstructured data has internal structure but is not structured via pre-defined data models or schema. It may be textual or non-textual, and human- or machine-generated. It may also be stored within a non-relational database like NoSQL.Typical human-generated unstructured data includes:

Text files: Word processing, spreadsheets, presentations, email, logs.
Email: Email has some internal structure thanks to its metadata, and we sometimes refer to it as semi-structured. However, its message field is unstructured and traditional analytics tools cannot parse it.
Social Media: Data from Facebook, Twitter, LinkedIn.
Website: YouTube, Instagram, photo sharing sites.
Mobile data: Text messages, locations.
Communications: Chat, IM, phone recordings, collaboration software.
Media: MP3, digital photos, audio and video files.
Business applications: MS Office documents, productivity applications.

Typical machine-generated unstructured data includes:

Satellite imagery: Weather data, land forms, military movements.
Scientific data: Oil and gas exploration, space exploration, seismic imagery, atmospheric data.
Digital surveillance: Surveillance photos and video.
Sensor data: Traffic, weather, oceanographic sensors.

Semi-structured

Semi-structured data can contain both the forms of data. We can see semi-structured data as a structured in form but it is actually not defined with e.g. a table definition in relational DBMS. Example of semi-structured data is a data represented in an XML file.

Benefits of Big Data Processing

Businesses can utilize outside intelligence while taking decisions:- Access to social data from search engines and sites like facebook, twitter are enabling organizations to fine tune their business strategies.

Improved customer service

Traditional customer feedback systems are getting replaced by new systems designed with Big Data technologies. In these new systems, Big Data and natural language processing technologies are being used to read and evaluate consumer responses.

Early identification of risk to the product/services, if any

Big Data technologies can be used for creating a staging area or landing zone for new data before identifying what data should be moved to the data warehouse. In addition, such integration of Big Data technologies and data warehouse helps an organization to offload infrequently accessed data.

Summary

Big Data is defined as data that is huge in size. Bigdata is a term used to describe a collection of data that is huge in size and yet growing exponentially with time.
Examples of Big Data generation includes stock exchanges, social media sites, jet engines, etc.
Big Data could be 1) Structured, 2) Unstructured, 3) Semi-structured
Volume, Variety, Velocity, and Variability are few Characteristics of Bigdata
Improved customer service, better operational efficiency, Better Decision Making are few advantages of Bigdata

Characteristics Of Big Data
In recent years, Big Data was defined by the “3Vs” but now there is “5Vs” of Big Data which are also termed as the characteristics of Big Data as follows:

Volume:- The name ‘Big Data’ itself is related to a size which is enormous.Volume is a huge amount of data.To determine the value of data, size of data plays a very crucial role. If the volume of data is very large then it is actually considered as a ‘Big Data’.Dealing with Big Data it is necessary to consider a characteristic ‘Volume’.Example: In the year 2016, the estimated global mobile traffic was 6.2 Exabytes(6.2 billion GB) per month. Also, by the year 2020 we will have almost 40000 ExaBytes of data.

Velocity:-Velocity refers to the high speed of accumulation of data.In Big Data velocity data flows in from sources like machines, networks, social media, mobile phones etc.There is a massive and continuous flow of data. This determines the potential of data that how fast the data is generated and processed to meet the demands.Sampling data can help in dealing with the issue like ‘velocity’.Example: There are more than 3.5 billion searches per day are made on Google. Also, FaceBook users are increasing by 22%(Approx.) year by year.

Variety:- It refers to nature of data that is structured, semi-structured and unstructured data from homogeneous or heterogeneous sources.Big Data may not belong to a specific format. It could be in any form such as structured, unstructured, text, images, audio, video, log files, emails, simulations, 3D models, etc. New research shows that a substantial amount of an organization’s data is not numeric; however, such data is equally important for decision-making process. So, organizations need to think beyond stock records, documents, personnel files, finances, etc.

Veracity:-It refers to inconsistencies and uncertainty in data, that is data which is available can sometimes get messy and quality and accuracy are difficult to control.Big Data is also variable because of the multitude of data dimensions resulting from multiple disparate data types and sources.Example: Data in bulk could create confusion whereas less amount of data could convey half or Incomplete Information.

Value:-After having the 5 V’s into account there comes one more V which stands for Value!. The bulk of Data having no Value is of no good to the company, unless you turn it into something useful.Data in itself is of no use or importance but it needs to be converted into something valuable to extract Information. Hence, you can state that Value! is the most important V of all the 5V’s.

Hadoop Demons

NameNode
Secondary NameNode
DataNode
JobTracker
TaskTracker

Hadoop

Big Data

No comments:

Post a Comment